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1 . 0 INTRODUCTION 


The CARE III (Computer-Aided Reliability Estimation, 
version three) computer program is being developed as a 
general-purpose reliability estimation tool for fault-tolerant 
avionics systems. The first CARE Program, developed at the 
Jet Propulsion Laboratory in 1971, provided an aid for esti- 
mating the reliability of systems consisting of a combination 
of any of several standard configurations (e.g. standby- 
replacement configurations, triple-modular redundant configu- 
rations, etc.) Non-unity dormancy factors were allowed as 
well as user-supplied non-unity coverage probabilities. 

CARE II was subsequently developed by Raytheon, under 
contract to the NASA Langley Research Center, in 1974. It, 
like the original CARE, was based on a combinatorial reli- 
ability model. The model in this case, however, was consider- 
ably more versatile. 

A simple mathematical expression was used to evaluate 
the reliability of any redundant configuration over any 
interval during which the failure rates and coverage parameters 
remained unaffected by configuration changes. In addition, 
provision was made for convolving such expressions in order to 
evaluate the reliability of a "dual-mode" system; i.e., a 
system in which a single coverage-parameter/failure-rate con- 
figuration change was allowed during the interval of interest. 

A coverage model was also developed to determine the various 
relevant coverage coefficients as a function of the available 
hardware and software fault detector characteristics (detec- 
tion delay, scheduling interval, etc.), and the subsequent 
isolation and recovery delay statistics. 
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CARE II suffers from two limitations that make it 
difficult to use as a general-purpose reliability estimation 
tool for avionics systems. The most serious of these limi- 
tations is its two-mode restriction. In many avionics system 
configurations, each new failure precipitates a mode change 
(i.e., a failure rate or coverage coefficient change). Con- 
sequently, many operating modes are possible. While CARE II 
could be modified to allow this possibility, the resulting 
program would be cumbersome and the computer run-time excessive . 

A second limitation in CARE II is the lack of a mechanism 
for specifying multiple success criteria; i.e., for allowing 
the user to indicate that there are several operational system 
configurations, as is almost always the case in avionics sys- 
tems. Although this latter limitation could be easily remedied 
within the CARE II structure, the former could not. According- 
ly/ it was decided to develop a more general reliability esti- 
mation computer program specifically designed to overcome these 
limitations. The present report summarizes the accomplishments 
made during the first phase of this two-phase effort. 

Three tasks were emphasized during phase one: requirements 

assessment; definition of program structure; development of 
the reliability model. The remaining work needed to complete 
the objectives of the CARE III program will be accomplished 
during phase two; viz: adaptation of the CARE II coverage model 

to satisfy CARE III requirements; development of a user inter- 
face for system configuration and success criteria specifica- 
tion; integration of the various program modules into a unified 
program structure. 
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The structure postulated for the CARE III program is 
described in section 4. In brief, the program will consist 
of three independent modules . CAREIN interprets user inputs 
defining the system structure, the system success criteria, 
the various fault models and coverage parameters, and generates 
files to be used by COVRGE and CARE 3 . COVRGE then translates 
these specifications into the coverage parameters associated 
with each of the various system stages and operating modes. 

The third program module, CARE 3 , operates on files generated 
by both CAREIN and COVRGE to produce system reliability 
estimates in accordance with the user-defined success criteria. 

The major effort during phase one was devoted to developing 
and programming the reliability model to be implemented in 
CARE 3 . The results of this effort are described in detail in 
section 3 . The selected mathematical model is based on 
Kolmogorov's forward equations. In a parallel effort, a 
detailed examination was made into techniques for obtaining 
solutions to multi-state Markov models. The initial impetus 
for this work was to develop an alternative model for CARE3 
should the Kolmogorov method run into computational difficul- 
ties . The latter method, however, proved to be highly effec- 
tive for the class of structures of concern here, overcoming 
most of the limitations (e.g., extremely large number of 
states, time invarient transition rates) associated with time- 
homogeneous Markov models. Nevertheless, the Markov investi- 
gation was continued when it became apparent that these tech- 
niques would be useful in determining coverage parameters 
associated with intermittent faults. (An example of this is 
presented in paragraph 3.3). The results of this investiga- 
tion are described in an appendix to Volume II of this report. 
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The coverage model to be implemented in COVRGE will be an 
extension of that implemented in CARE II (Ref. 1) . This 
coverage model has been modified to produce the (generally 
time-varying) recovery rates, as required by CARE III, 
rather than the recovery probabilities used in CARE II. The 
model has not yet been integrated into CARE III, however, 
nor has it been combined with intermittent fault models. 

(The reliability model tests described in section 3 used 
simplified coverage models involving either constant recovery 
rates or fixed recovery delays.) Completion of the coverage 
model and its integration into the CARE III structure is one 
of the first tasks to be completed during phase 2. 

The major task remaining to be accomplished during phase 
2 is the development of CARE IN . The intent here is to provide 
the user maximum flexibility in specifying the system structure, 
fault models, coverage parameters, success criteria, etc., in 
the simplest possible format. A general approach to this task 
is outlined in section 4 and detailed in Volume II. of this 
report . 
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2.0 CARE III REQUIREMENTS ASSESSMENT 


Four fault-tolerant systems were examined in an effort to 
characterize the class of structures CARE III will be expected 
to model and to estimate the kind and range of parameters 
needed to describe these structures. The four systems examined 
were: Boeing Aircraft Corporation's ARCS (Airborne Advanced 

Reconf igurable Computer System, Ref. 2), SIFT (Software Imple- 
mented Fault Tolerance Computer, Ref. 3) under development at 
SRI, International, FTMP (Fault-Tolerant Multi-Processor, Ref. 

4) under development at Charles Stark Draper Laboratory and 
FTSC (Fault-Tolerant Spacecraft Computer, Ref. 5) under devel- 
opment at Raytheon. A study was made both of the structures 
of these systems and of the techniques used to estimate their 
reliability. The results of this study are briefly summarized 
in paragraph 2.1. Paragraph 2.2 then lists the requirements 
that were imposed on the CARE III reliability and coverage 
models as a result of this study and due to other considerations 

2 . 1 SUMMARY OF FINDINGS 

2.1.1 SIFT 

The SIFT computer system consists of a number of identical 
processors (containing both memory and processing elements) 
interconnected by several interprocessor buses.* The processors 
are dynamically assigned to various groups, with each group 
typically comprised of three processors, but in some cases as 
many as five. The loosely synchronized processors in each 
group perform the same operations on the same data and transmit 


*The bus structure was changed subsequent to this investigation; 
the change, however, does not modify the conclusions reached 
here concerning CARE III requirements. 
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the results of these operations to each of the other processors 
in their group. Each processor evaluates its own health and 
that of the other processors in its group by comparing these 
results. Faulty processors and buses are identified by 
analyzing discrepancies in these results; reconfiguration takes 
place whenever a majority of processors in a group concludes 
that one of its elements is defective. 

In CARE II terminology (Ref. 1) , SIFT is comprised of two 
stages:* a processor stage consisting of m processors, and a 
bus stage comprising n buses. The system has failed by time t 
if fewer than m^ processors or fewer than n^ buses are still 
functioning, or if a coverage failure has occurred prior to 
that time . 

The reliability of SIFT was estimated in Ref. 3 by using 
a continuous-time Markov model with time-independent transi- 
tion parameters. Coverage was taken into account be defining 
a deterministic latency period x Q between the occurrency of a 
failure and its detection. If a second processor or bus fails 
during this period, a system failure is declared. Since all 
processors and buses are presumably always powered, the dor- 
mancy factor is assumed to be unity. 

Note that the probability of a coverage failure is a 
function of the number of processors (buses) functioning at 
the time of a processor (bus) failure. That is, the probability 
of a second processor or bus failure, and hence a coverage 
failure during the T Q -second latency period depends upon the 
number of processors or buses functioning at that time. Since 


The term "stage" refers to an ensemble of identical, inter- 
changeable units. 
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the possibility is (not unreasonably) ignored that two bus 
(processor) failures occur between the time that a processor 
(bus) failure occurs and the time that the failure is detected, 
the coverage parameters associated with a processor (bus) 
failure are independent of the number of buses (processors) in 
operation at the time of the failure. Thus, the system can be 
modeled as a two-stage configuration, a processor stage exhib- 
iting m-1 modes (corresponding to the different numbers of 
processors that could be functioning at the time of a new 
failure) , and an (n-1) -mode bus stage. It is important to 
emphasize that there is no coupling between the two stages; a 
mode change in the processor stage does not result in a bus- 
stage mode change, and vice-versa. This simplifies the 
reliability model since each stage can be treated independently. 

Transient faults in the SIFT model are, like permanent. 

faults, of two types: processor faults and bus faults, both 

having time-independent rates of occurrence. Any transient 

fault can have one of two outcomes: with probability p the 

tr 

system recovers completely; with probability 1-p the system 
loses the afflicted bus or processor. The following events 
are not allowed: a transient fault occurring during a latent 

permanent fault; a permanent fault occurring during a still 
active transient; a transient fault occurring while a previous 
transient is still active. 

2.1.2 FTMP 

The FTMP is comprised of a set of processors, a set of 
memories, and a set of buses over which processors and mem- 
ories can communicate. The processors, memories, and buses 
are each grouped into "triads." A processor triad consists 
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of three tightly coupled processors all committed to the same 
task; a memory triad consists of three memory modules all con- 
taining the same data; and a bus triad consists of three buses 
with each bus used for transmission purposes by exactly one 
of the three units comprising each processor or memory triad. 
The system is thus partitioned, at any given time, into a 
number of processor triads and a number of memory triads, with 
all processor -memory communication taking place over a common 
bus triad. Each processor-bus and each memory-bus interface 
(bus guardian unit) contains a voter that produces as an out- 
put the majority-vote of the three inputs received over the bus 
triad. Faulty processors, memories, or buses are identified 
by diagnosing the pattern of discrepancies observed at these 
voters . 

Four different reliability models for the FTMP are 
described in Ref. 4. The first involves a 146-state discrete- 
time Markov model with time-invarient transition parameters . 

The states are defined by the number of detected and undetected 
faults in the processor modules, the memory modules, the bus 
system and the bus guardian units. The Markov model was kept 
to 145 states by identifying all system states involvincr more 
than two undetected faults or more than three total faults with 
the failed state. Other approximations were also made in order 
to obtain tractable transition parameters. Even so, the com- 
puter time needed to obtain numerical results using this model 
were such that reliabilities were determined for only the first 
second of FTMP operation . 

To extend these results, a simplified 11-state Markov 
model was obtained by treating modules having detected failures 


8 


as though they were again operational and by assuming any 
combination of three or more faults cause a system failure. 
Numerical reliability results were then obtained for the 
first 40 seconds of FTMP operation using this model. 

The reliability of the FTMP for longer durations was 
estimated using a combinatorial model to determine the 
probability that at least Pg of P processors, Mg of M mem- 
ories, and Bq of B buses are operating at time t (assuming 
perfect coverage) and by extrapolating the coverage failure 
probabilities obtained using the 11-state Markov model. 

In a later investigation, the 11-state Markov model was 
modified to determine the effect of transient faults on the 
FTMP for short (100 minute) missions. The permanent failure 
states in the original model were replaced by intermittent 
failure states in which failures healed (temporarily) at a 
constant rate a and recurred at a constant rate <5 . (Once a 
failure has occurred, it remains in the intermittent mode 
either until it is detected or until it results in a system 
failure . ) 

In all of these models, coverage was defined in terms 
of the probability that a second fault of a given type 
occurred during the exponentially distributed latency period 
of the fault in question. 

In CARE II terminology, then, the FTMP model consists of 
three stages: processor, memory and bus. There are as many 

operating modes as there are modules, since the recovery 
probability is a function of the number of previous failures 
in each of the three stages. Thus, the three stages are 
"coupled" in that the coverage associated with a fault in 


9 



stage i depends, in part, on the absence of faults in stage 
j f i during the latency period. 

2.1.3 ARCS 

The ARCS system involves a computer stage (consisting of 
three or four identical computers), several sensor stages, and 
several servo (actuator) stages. The non-internally-redundant 
computers accept information from their associated sensors, 
interchange this information over cross-channel buses, and 
generate signals to their associated servo systems . The out- 
puts of the (generally three) servos comprising a given stage 
are voted on by a mechanical voting mechanism assumed to have 
complete first-failure fault tolerance. 

The computers use a combination of hardware and software 
techniques to monitor their own performance and that of their 
associate computers, and to identify defective sensors and 
servos. Reconfigurations (following which, for example., a 
servo is deactivated, or the outputs of some sensor or computer 
are ignored) are effected through information passed back and 
forth among the ARCS computers . 

The ARCS system was modeled in Ref. 2 by breaking 
it up into stochastically independent stages and then repre- 
senting each stage with a continuous-time, constant-parameter 
Markov model of up to ten states . The coverages used in de- 
riving the Markov transition parameters were estimated, in 
some cases, by testing actual devices using a randomly selected 
subset of possible faults; in other cases, coverage probabili- 
ties were simply postulated since no data were available. 
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The ARCS reliability model took into account the 
peripheral devices (sensors and servos) as well as the central 
computer. The ARCS architecture is such that a failure in a 
redundant module in one of its stages may cause the function 
of a module in one or more of its other stages to be lost as 
well. Accordingly, provision was made whereby the user could 
specify a "dependency" relationship among the various stages 
of the ARCS configuration; i.e., the user could in effect 
specify more than one definition of an operational system 
configuration. 

Transient and intermittent faults were both taken into 
account in that they were allowed to influence the Markov 
transition parameters . Transients affected these parameters 
to the extent that they were "leaky"; i.e., the permanent fault 
hazard rate was increased by a term reflecting the rate of 
occurrence of transients of duration exceeding some test 
interval T. Since the Markov model implemented in the ARCS 
reliability evaluation program allowed unidirectional transi- 
tions only, the effect of intermittent faults (causing 
transitions back and forth between two states) was approximated 
by calculating an "effective" unidirectional transition 
parameter from one of these states to the other. 

2.1.4 FTSC 

The FTSC (Ref. 5) is an internally redundant central 
processor being developed for the U.S. Air Force. It is 
partitioned into nine types of elements (central processing 
unit, memory module, direct memory access unit, serial bus 
interface unit, power module, timing module, configuration 
control unit, circumvention unit, and hardened timer) 
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interconnected by seven different bus networks (address bus, 
data bus, control bus, power bus, timing bus, interrupt bus, 
status bus) . Each of these elements and buses is provided 
with redundant spares, in various configurations depending 
upon its complexity. (One element, the memory module, is 
itself internally redundant as well.) 

The current FTSC reliability model is a simplified, one- 
mode, sixteen-stage version of CARE II. In some cases, non- 
unity dormancy factors were used to account for the lower 
failure rate of inactive and unpowered modules. 

2 . 2 CARE III REQUIREMENTS 

The emphasis in the previous section was on the techniques 
used to estimate the reliabilities of the systems in question. 

At a minimum, CARE III must provide a unified model for all 
four of those systems and hence reproduce, under the appropriate 
set of conditions, the results obtained using each of these 
models. This, of course, is a necessary but not a sufficient 
condition to place on CARE III. To be most useful, it must be 
flexible enough to overcome any limitations imposed by the 
above models (e.g., restrictive coverage models, limited fault 
models, etc.) and at the same time sufficiently general to 
allow other, as yet unspecified, fault-tolerant systems to be 
modeled without introducing artificial restrictions . The 
following paragraphs outline the requirements imposed on 
CARE III and explain the rationale for each of these require- 
ments in terms of the above objectives. 

1. Capability of modeling up to at least 40 stages. 

Rationale: This is specified in the CARE III Statement of 

Work. Although none of the systems considered in paragraph 2.1 
require as many as 40 stages, it is not difficult to conceive 
of systems that do. This requirement will be satisfied in 
CARE III by providing a means for concatenating independent 
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runs. If the coupling between stages is limited, 
it will in fact be possible to model an arbitrarily large 
number of stages by making repeated runs . 

2. Multiple operating modes for each set of coupled 
stages . 

Rationale: The operating mode of a system or subsystem 

is, so far as its reliability model is concerned, a function 
of its structure (number of units of various types that have 
to be operational for the system to function as specified) and 
its coverage parameters. If the system's structure or coverage 
coefficients change stochastically during its operating life- 
time (e.g., if they depend upon the number of faults already 
incurred) such changes must be reflected in its reliability 
model. If a mode change in one stage precipitates a mode change 
in some other stage, the two stages are said to be coupled. 
(Deterministic structural or coverage parameter changes must, 
of course, also be reflected in the reliability model. Such 
changes are relatively easily accommodated, however, by 
introducing time-dependent coverage parameters and by concatenat- 
ing reliability models representing the disjoint time intervals 
during which the system structure is invarierit. Thus, such 
mode changes impose no new constraints provided only that the 
coverage parameters are allowed to be time -dependent . ) 

CARE II allowed only one mode change (two operating modes) ; 
the exhaustion of the spares available at any one stage could 
cause the system to change from, say, a dual-redundant to a 
single-string configuration, thereby changing both the system 
structure and the coverage coefficients associated with each 
stage. Two of the systems discussed in paragraph 2.1, however, 


13 


(SIFT and ARCS) exhibited mode changes after each new fault. 
Thus, the two-mode limitation of CARE II is not acceptable for 
CARE III. 

3. Separate coverage model similar to that in CARE II 
but capable of handling latent and intermittent faults as well 
as permanent faults. 

Rationale: The major advantage in keeping the reliability 

and coverage models distinct (as they were in CARE II) is 
that it allows the user to concentrate on each of these two 
areas relatively independently and hence simplifies the task 
of defining the system model. In addition, there are some 
significant practical advantages (cf. Section 4) in separating 
the reliability model, driven by infrequently occurring 
failures, from the coverage model reflecting the much more 
rapid detection, isolation and recovery events. 

The need to handle both intermittent and latent faults in 
the coverage model is evident from the discussion in paragraph 

2 . 1 . 


4. Multiple success criteria 

Rationale: As ARCS clearly demonstrates, some redundant 

systems may be considered operational under any one of a number 
of possible conditions. It is therefore necessary for the user 
to be able to define each of those conditions and for CARE III 
to calculate the probability that at least one of them occurs . 

5. n-point failure mechanisms ("category 3" faults) 

Rationale: Most fault-tolerant systems exhibit "n-point- 

failure" mechanisms; i.e., sets of n failures (n>l) that can 
disable the system even though spare hardware is available. 

If two BGUs fail in the enable mode in the FTMP., for example, 
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the system is potentially inoperative even though spare opera- 
tional modules are available. CARE II modeled such failure 
mechanisms only for n = 1. Although the probability of such 
failures is generally a rapidly decreasing function of n, it 
cannot a priori be considered negligible for all n > 1. The 
concept of a single-point failure must therefore be generalized 
to take this into account. 

6 . Time-dependent hazard rates 

Rationale: All of the reliability models considered in 

paragraph 2.1 assumed constant hazard rates. There are at 
least two reasons why it would be desirable to relax this 
restriction: (1) Recent data indicate that at least in some 

environments (space) the hazard rates are far from constant. 

(2) The hazard rates associated with modules having internal 
redundancy are not constant even if the individual component 
hazard rates are. 

7. Transient faults 

Rationale: Most faults are modeled either as permanent 

or intermittent, the latter actually being permanent faults 
that manifest themselves intermittently. Some faults may 
well be transient in nature, however; e.g., faults due to 
noise or those due to improperly validated software. In such 
cases, no hardware damage has occurred and, as soon as the 
cause of the fault disappears, the system can, in principle, 
function as before. 

8. Non-unity dormancy factors 

Rationale: Of the four models discussed in paragraph 

2.1, only the FTSC model allowed non-unity dormancy factors. 

In some cases, it is reasonable to assume that dormant (e.g., 
unpowered or inactive) modules may have lower hazard rates 
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than active modules . Non-unity dormancy factors will be 
defined as follows: Let P(t) be the probability that an 

active unit survives until time t and let P a (t) be the 
probability that a dormant unit survives until time t. The 
exponent a<l is the dormancy factor. 
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3.0 RELIABILITY MODEL DEVELOPMENT 


Three basic mathematical approaches were considered 
for development of the reliability model: (1) Extension of 

the CARE II method. (2) Markov chain method. (3) A 
recursion technique based on Kolmogorov's forward differential 
equations . 

The CARE II approach was rejected because of the large 
number of operational modes needed to model some of the fault- 
tolerant systems of interest. The coverage probabilities 
in both the SIFT and the FTMP systems are functions of the 
number of units still operating. Thus, each new failure 
effectively defines a new mode of operation. As demonstrated 
in the CARE II Final Report (Ref. 1), the complexity of the 
closed-form analytic expressions used in the CARE II model is a 
rapidly growing function of the number of possible operating 
modes. Even if transform techniques (e.g. Laplace trans- 
forms) are used to eliminate the multiple integrals found in 
these expressions, the model becomes intractable for systems 
involving more than four or five operating modes. 

Some effort was made to generalize the basic CARE II 
equation (relating the probability of operating at time t 
with exactly St. known failures to the failure rates, coverage 
probabilities, number of active and spare elements, etc.) to 
include the case in which the coverage parameters were 
allowed to be functions of the number of previous failures 
in the stage in question. This would have, in principle, 
drastically reduced the number of required "system modes" 
since a mode change would no longer necessarily be needed to 
accommodate a change in the number of operating units in a 
given stage. This effort was abandoned, however, when it 
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became apparent that the cross-coupling between stage 
coverages (i.e., the dependence of the coverage in one 
stage on conditions in another stage) could also be a signifi- 
cant factor in some cases of interest. 

The term "Markov chain" in the present context denotes 
the following modeling structure: The system state at any 

given instant is characterized by all those parameters needed 
to determine both the likelihood that it will experience some 
fault at time t and the probability that it will successfully 
recover from that fault. These various system states are 
then interrelated through a set of transition functions repre- 
senting the rates at which the system state changes from any 
given state to any other state. (Thus, the transition functions 
r i j ( t ) and r_.^(t) relating states and define the condi- 
tional probability densities of transitions at time t from G. 

1 

to S_. and from S_. to respectively; cf . , Figure 3.1.) 

The avionics systems to be modeled by CARE III are to be 
extremely reliable; only rare combinations cf unlikely events 
can be permitted to cause the system to fail. Consequently, 
numerous parameters are needed to characterize each state and, 
in particular, its vulnerability to subsequent faults. Specifi- 
cally , each state is defined not only by the number of faults 
in each of its coupled stages, but by the status of each of 
these faults as well. The status of a fault is defined by all 
those parameters needed to determine the system's vulnerability 
to subsequent faults (e.g., detected; undetected, benign, inter- 
mittent fault of a given type; undetected, active, intermittent 
fault of a given type; etc.) It should not be surprising that 
under these circumstances, the number of states needed to 
characterize a system can be extremely large. If a system 
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From Other States 



To Other States 


Figure' 3.1 

General Structure of a Markov Model 
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consists of n coupled stages, if the i — stage can sustain as 
many as m faults and still be operational, and if the status 
of each stage-i fault can be any one of 5^ possibilities, the 
total number N of system states that have to be considered is 


N 


n 

n 

i=l 



j=0 




This numner can be large even for relatively small parameters 
£ i' nu, and n. (For example, ^hen n = 4, and l. = 6, m. = 2 
for all i, N = 614,656,) " X 

Mathematical methods for determining the probability 
that a system is in any one of its Markov states at any time t 
are well known and particularly efficient solution techniques 
are available when the state transition functions r..(t) are 
independent of t. With the Markov model just described, it 
is possible (although undesirably restrictive) to treat these 
functions as time invarient, so these mathematical methods can, 
in fact, be applied. Even so, these methods become computa- 
tionally infeasible when the number N of states becomes large, 
even when advantage is taken of the fact that the number of 
allowed state transitions is much less than the maximum possible 
number, N(N-l) . Since, as already noted, the number of states 
needed to describe systems of interest here can easily exceed 
10 , another approach was clearly needed. (Nevertheless, a 
thorough investigation was made into methods for efficient 
computer manipulation of Markov model transition matrices. 

This investigation was undertaken for two reasons: (1) to 

provide an alternative should difficulties be encountered in 
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developing the preferred CARE III approach; (2) to develop 
techniques that may be useful in implementing the CARE III 
coverage model. The results of this investigation are sum- 
marized in Volume II of this report.) 

By far the most promising of the reliability modeling 
techniques examined for the class of fault-tolerant systems 
of concern here was one based on Kolmogorov's forward differ- 
ential equations? for convenience, it will be referred to as 
the Kolmogorov Method. Several variations on this method 
were postulated and examined in detail in order to determine 
the most efficacious procedure for applying it to the problem 
at hand. The variations considered are described in the fol- 
lowing paragraphs. Before proceeding, however,- it may be 
useful to outline the general approach. 

As already noted, the major problem with the Markov 
Method, as outlined, is the inordinately large number of states 
needed to distinguish all the various fault conditions. As 
also noted, these conditions can be specified in terms of two 
sets of parameters: 1) the number of faults in each of the 

coupled stages; 2) the status of each of these faults. The 
essence cf the Kolmogorov approach is in the separate treatment 
of these two sets of parameters. That is, system states are 
used to represent only the first set of parameters; the effect 
of the second set of parameters is reflected implicitly in the 
state transition functions. 

The separate treatment of the two sets of parameters 
needed to model fault occurrence and fault recovery has two 
major advantages: 1) It drastically reduces the number of 

states needed to represent the system (from the previously 
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defined number N to, in the same notation, 


^ ^ ^ m i + ^ ,e */ from N = 614,656 states in the 

previous example to = 81) . 2) It circumvents the serious 

computational difficulty presented by a model that combines 
in one homogeneous structure the relatively infrequent state 
transitions characterized by the first set of parameters 
(perhaps one fault/10 hours) and the much more frequent 
transitions due to fault status changes (e.g., detection 
rates of the order of seconds, intermittent fault transition 
rates of the order of minutes or less, error generation rates 
of the order of milliseconds) . 

The major disadvantages of this modeling approach are 
also two— fold: 1) The state transition functions are now con- 

siderably more difficult to determine. They are in effect 
conditioned only on time and on the number of previous failures 
of each type; the probability density of a transition under 
these conditions can be determined only by averaging over all 
possible values of the implicit parameters. 2) The state- 
transition functions are necessarily functions of time, thereby 
precluding from the outset the time-homogeneous Markov chain 
solution techniques mentioned previously. 

The first of these disadvantages is reflected in a more 
complex coverage model than would otherwise be required. The 
important point here, however, is that the combinatorial and 
Markov techniques mentioned earlier can be applied at. the 
coverage model level as well as at the reliability model level. 
Furthermore, the number of states needed to determine the 
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conditional transfer functions is vastly less than the number 
of states in an undifferentiated Markov model of the entire 
system. Thus, the coverage model computational effort, while 
greater than it would otherwise have been, is still almost 
negligible compared to that needed to determine the state 
probabilities for the system level Markov model. In; effect, 
the model has been reduced from one having N = n xn x...x n 

1. £ Xj 

states to one having n 1 + n 2 + . . . + n £ states, with ^denoting 
the number of relevant states given that i faults have already 
taken place. (The reduction is in fact more dramatic than this 
since much of the computational effort needed to determine the 
transition functions given i faults can also be used to deter- 
mine these functions given j ^ i faults.) 

The detailed development of the CARE III coverage model 
will be undertaken in Phase 2 of this effort. As currently 
invisioned, it will combine both combinatorial and Markov 
techniques. The former will be used to determine the prob- 
ability that a given combination of faults can, under a 
specific set of conditions (e.g., all faults simultaneously 
active) cause the system to fail; the latter will be used to 
determine the probability that the specified set of conditions 
does indeed obtain at any given time. Some specific examples 
of this coverage modeling approach, used during Phase one as part 
of the reliability model test exercise, are described in 
paragraph 3.3. 

The second of the above-mentioned disadvantages to the 
modeling approach outlined here is largely overcome by basing 
the solution techniques on Kolmogorov's forward differential 
equations. The procedure for doing this is the subject of 
the remainder of this section. 
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3.1 THEORETICAL DEVELOPMENT 


Let Pj|.(t|T) denote the conditional probability that a 
system is in state j at time t given that it was in state 
i at time x. Similarly, let P^ | ^ ( 1 1 r; , T ) denote the 

conditional probability that a system is in state £ at time 
t given that it was in state j at time r| and in state i at 
time x. Then, clearly, for any x<n<t, 

p Mi (t|T > ■ £ p j|i (,1 l T)p i| j , <!> 

j 

with the sum taken over all the (assumed finite number of) 
possible intermediate states j. (If, for all xcrict, 

P &|j t) = P^j.itln), then equation (1) reduces to 

the Chapman-Kolmogorov equation for continuous-time, discrete 
state systems.) 

It follows from equation (1) that 


P i |i <t + At ' T > = P jHi (t|T > P l|i, !«= + “It, T) 

( 2 ) 

+ £ P j|i (t |t> p *| j , i<t + 4t|t, T) 
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Let 



lim 

At 0 


1 - p tU, i (t + T) 

it 


and 


c . 


P H -i i (t + tf t) 

je.|i (t l T)A ja|i <t > T) " lim 

At •* 0 A 


(The reason for this latter notation will become apparent 
shortly.) Then, rearranging terms in equation (2), dividing 
by At and taking the limit as At 0 yields 


9P *li (t ' T) 

3t 


= -P 


*ii tt|T,x tii <t i T) 


(3) 


+ 


s> 




This set of equations is a form of the Kolmogorov 
forward equations. It differs from the more conventional 
form in that the transition parameters c . „ , . (t I t) X . „ i . (t I t) 
are also functions of the initial state i of the system at 
time t. If the notation indicating the condition that the 
system be in state i at time t is suppressed, equation (3) 
can be expressed in the more convenient form 
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ap t (t > 


(4) 


dt 


- p * (t) V t! 






It must be remembered in the ensuing discussion, however, 
that the transition parameters may also be functions of the 
initial conditions. 

Four recursive reliability modeling methods based on 
Kolmogorov's forward equation, equation (4), were investigated 
in an effort to find the most suitable application of this 
result to the class of problems of concern here. These four 
methods are described in the following paragraphs. 

3.1.1 DIFFERENCE EQUATION FOR RELIABILITY 

Let P^(t) denote the probability that the system is 
operating at time t having undergone exactly £ failures. 

(If it is necessary to distinguish between different types 
of failures, £ will actually be a vector; e.g. £ = (i, j, k) 
indicating i failures of type 1 , j of type 2 and k of type 
3.) Let A (t) denote the rate at which failures occur given 

/V 

that the system has sustained £ failures by time t. Let 
A..(t) denote the rate of occurrence of failures that would, 
if coverage were perfect, lead from state j to state £ (i.e., 
from the state characterized by j failures to that character- 
ized by £ failures) . 

Then 

£ X j£ (t) = A.(t) 

£ 
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with the sum taken over all states i which can be reached 

in one transition from state j. Finally, let c . (t) denote 

3 ^ 

the coverage probability associated with a failure which 
would, in the event of perfect coverages, cause a transition 
from state j to state l. (The coverage associated with a 
failure occurring when the system is in state j is therefore 

c j (t) 

i 

with the range of summation and the term X . (t) as previously 
defined.) 

With these definitions, equation (4) , rewritten in 
difference-equation form 


P £ (t + At) = P £ (t) (1 - X A (t) At) 

(5) 

j 


defines a recursion, on both t and Z, on the probabilities 
P^(t) . The probability that the system is successfully 
operating at time t is then just 

R(t) = E (6 

JteL 

with the summation taken over all allowable states H . 
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Actually, equation (5) defines a recursion on 8, only if 
the states can be suitably ordered. This is the case, for 
example, if it is impossible to go from a state having | | i | | 
failures (with ||X,|| indicating the number of failed units 
represented by the vector Jl) to a state having fewer than 
||£| |; i.e., if failed units never "heal". This would 
appear to eliminate transient failures from the model. This 
is not the case, however, if the coverage coefficients make 
the proper distinction between "leaky" and "non-leaky" 
transients . 

3.1.2 DIFFERENCE EQUATION FOR UNRELIABILITY 

Let P*(t) be the probability that the system would be 
operating in state £, at time t were coverage perfect, let 
Q^(t) = P*(t) - P^(t) and let Cj^(t) = 1 - c^(t). Then 

equation (5) can be rewritten: 


Q^(t + At) = Q^CtHl - X^ ( t) At] 

+ Z tQj^) + P. (t)c. Jl (t)]X jA (t)At 

j 


(7) 


and the system unreliability becomes 


1 - R ( t) = £ <Vt) + £p*(t) 

8 eL leL 


( 8 ) 
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with L as previously defined and L U L the set of all 

. t 

possible states. 


An interesting variation on the approach suggested by this 
formulation is obtained by treating all states representing 
system failures as terminal rather than transient states. 

This is equivalent to redefining Q^(t) as the probability 
that the system has failed by time t and at the time of the 
failure it contained exactly i failed unitsT Since unit 
failures occurring after the system has failed do not in 
this case cause a state change, equation (7) now assumes the 
simpler form 

Q £ (t+At) = Q £ (t) + EP(t)c . (t)X (t)At 

i i ] 3 

Now, however, the two probabilities 

A A 

Q(t) = 1 Q (t) and P* (t) = I P *(t) 

*eL * AeL £ 

no longer represent disjoint events and equation (8) becomes 
an inequality rather than an equality. That is, the prob- 
ability Q ( t) here is a measure of the event (A) that the 
system has failed by time t due to a coverage failure; P(t) 
measures the event (B) that t units have failed by time t. 

Thus, l-R(t) = P(AUB) = P (A) + P (B) - P(a|b)P(B) <_ P (A) + P (B) . 
(It can be agreed that P(a|b) > P(A); that is, the conditional 
probability of a coverage failure given that the total number 
of failures exceeds some minimum must be greater than the un- 
conditional probability of a coverage failure. Thus, l-R(t) = 
P(AUB) <_ Q ( t) + P*(t) - Q(t)P*(t) .) Since clearly 1 - R(t) > 
max (Q ( t) , P*(t)), the fact that the events A and B are not 
mutually exclusive is of potential concern only when Q(t) and 
P*(t) are both small and of the same order of magnitude. Even 
this case, the unreliability would be overestimated by at 
most a factor of two. The reduction in computational com- 
plexity, potentially achievable by treating each failed state 
as a terminal state, may well justify this small reduction in 
accuracy; this possibility will be explored during Phase Two 
of this study. 
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This formulation offers a significant potential 
advantage when, as is the situation of concern here, R(t)«* 1 
for all t of interest. In this case, 

£ p A (t)« i 

AeL 

and the sum of the round-off errors obtained in calculating 
the individual P^(t) terms may well be of the order of the 
quantity of major interest; viz: the unreliability 

1 - £ P. (t) . 

AeL 

Under these same conditions, however, the terms Q (t) must 
be small for all leL and the terms P*(t) must be small for 
all AeL . If the round-off error associated with each of 
these terms can be kept small relative to the terms them- 
selves, it follows that the cummulative round-off error will 
be small compared to their sum. 

3.1.3 INTEGRAL EQUATION FOR RELIABILITY 

Equation (4) is a linear, first-order differential 
equation. This equation can be easily solved to yield: 
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This also can be used to define a recursion on % and t. 
If the integrals in equation (9) are replaced by their 
first-order approximations: 


/ t + At 
f (x) dx 



dr + f (t) At 


0 


and if the exponentials are replaced by the first two terms 
in their power-series expansions: 




1 - £(t) 


equation (9) is identical to equation (5) . If more 
sophisticated approximations are used, however, it might 
well be possible to achieve accuracy comparable to that 
attainable with the equation (5) difference ecruations but 
without the need to use such small step sizes At. This 
possibility was investigated using Simpson's rule integration 
for the integrals in equation (9) and using an existing 
exponential evaluation subroutine. The results of the two 
approaches are compared in Section 3.2. 


3.1.4 INTEGRAL EQUATION ON UNRELIABILITY 


If the substitutions described in paragraph 3.1.2 are 
made in equation (9), the resulting expression assumes the 
form: 


V * 1 



£ 


I ° j (T) + P jH)c 1t (T) 


-V 


V T) 


dx 


e 0 


* A (n)<*ri 


(10) 



This formulation has the same potential advantage over that 
represented by equation (9) as the equation (7) approach 
has over the equation (5) approach. 

3.2 EVALUATION OF THE KOLMOGOROV RECURSION METHODS 

It quickly became apparent, after only a few trial 
program runs, that the recursions on unreliability were 
decidedly superior to those based on reliability for the 
situations of interest here. Although the reliability 
recursions did yield acceptable results, considerably better 
results could be obtained with comparable program execution 
time (larger step sizes) using the unreliability recursions. 
Consequently, the competition was quickly reduced to one 
between the method described in paragraph 3. 1.2 and that 
described in paragraph 3.1.4. 

The only approximations in the recursions developed in 
Section 3 ! .l are those introduced in approximating a differential 
equation by a difference equation or by approximating an 
integral by a discrete summation. The modeling task is 
considerably simplified, however, if one other approximation 
is made in these formulations. This approximation involves 
the determination of the coverage coefficients Cj^(t). 

In the examples to be considered here, the coverage 
coefficients are the only parameters in the reliability 
model recursions that are influenced by the implicit con- 
dition that the system was in state :1 = 0 at time t = 0. 

These terms are functions of, among other things, the probabil- 
ity that any of a certain subset of failures are still latent 
at the time of occurrence of the failure in question. Since 
| | I | | failures took place in time t, it is clear that the 
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likelihood of a latent failure at time t is a generally 
increasing function of the ratio [|&||/t. If no other 
conditions were imposed, it would be relatively easy to 
determine the probability that y latent failures are present 
at time t given that the system was in state j at time t~. 
There is another condition, however: the system was still 

operating at time t~. This condition reduces the likelihood 
of certain failure sequences and hence perturbs the 
stochastic process characterizing failure events relative 
to the case when this condition does not apply. For example, 
the fact that the system is still operating reduces the 
probability that two failures occurred 'within a short interval 
of each other if a system failure would have resulted were 
one of these failures latent when the other took place. 

It is apparent (or at least it will become apparent once 
specific examples are considered) that the effect of this 
perturbation in the stochastic failure process must be highly 
insignificant except, possibly, for very small values of t. 
in which case all failure events are extremely unlikely. 
Accordingly, this effect is ignored in the following formula- 
tions. The resulting distribution of latent faults is 
precisely that that would be found were no distinction made 
as to whether the system was operational or not; i.e., if no 
distinction was made between the state represented by the 
probability (t) and that represented by (t) . Since the 
probability of being in either of these two states is P*(t), 

v 3 

therefore, the probability of a system failure at time t can 
be overbounded by replacing P.. (t) in equation (7) or (10) by 
P^ (t) and ignoring the condition on c.,(t) just discussed. 
Further, since ignoring this condition on the failure process 
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presumably results in a more favorable distribution of fault 

events so far as coverage at time t is concerned^, leaving P^(t) 

in equations (7) and (10) should result in a lower bound 

on the probability of system failure. In fact/ as several 

computer runs demonstrated, the calculated system reliability 

is identical to six or seven decimal places regardless of 

whether P,(t) or P^ (t) is used. This of course supports the 
D 3 

contention that the ignored condition is in fact not significant. 


The following paragraphs discuss the results obtained in 
applying the methods discussed in Section. 3,]. (primarily those 
of paragraphs 3.1.2 and 3.1.4) to the FTMP and SIFT computers. 
It should be emphasized here that the purpose of these 


^To illustrate this, consider the following simplified situa- 
tion. Suppose failures can occur only at discrete instants 
of time (t = 0 , 1 , 2 , ...), that no two failures can occur 

simultaneously, and that each failure is latent for exactly 
one unit of time. If a second failure occurs during the 
latency of a previous failure (i.e., exactly one time unit 
later) , the system fails. Now consider C 2 3 (t = 8 ) . If the 
condition that the system is still operating at time t = 7 is 
ignored, there are exactly (|)=28 ways in which 2 failures 
could have occurred in the 8 time instants t = 0 , 1 , ..., 7 ; 

exactly 7 of these failure sequences result in a latent 
failure at t. - 8 . The probability c 2 3 ( 8 ) of a coverage 
failure is therefore 7/28 = 0.25. If' the condition in 
question is not ignored, however, the number of possible 
sequences is reduced to 21 , 6 of which result in a latent 
failure at t = 8 . The probability of a coverage failure is 
thus increased to 6/21 = 0.286. Note that even in this 
extreme case, with t small (only 8 times the latency period) , 

| U i | large (the third failure occurs after only 8 latency 
periods), and with all.' latent failures causing a system 
failure in the event of any other failure, the effect of the 
condition in question is to increase c by 14%. Under more 
realistic conditions, the effect on the coverage coefficients 
should be entirely insignificant. 
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exercise .was not to model the computers themselves, but 
rather to incorporate the same general assumptions used in 
the previously developed models for these computers and to 
compare the results thus obtained with the results obtained 
using these earlier models. 

The purpose of this effort was to judge the efficacy of 
the various reliability models under consideration before 
proceeding with their more detailed development. In order to 
accomplish this, it was necessary to derive analytic expres- 
sions for the coverage probabilities needed in the reliability 
model. This task was subsequently eliminated, so far as the 
user is concerned, by restructuring the reliability model. 

This restructuring, and the application of the restructured 
model to both FTMP and SIFT are described in paragraph 3.3. 

The following paragraphs, therefore, concentrate on the 
results of this reliability model comparison rather than on 
the derivation of expressions for c^ (t) . 

3.2.1 APPLICATION TO FTMP - PERMANENT FAILURE CASE 

The four recursions discussed in paragraphs 3.1.1, 3.1,2, 
3.1.3, and 3.1.4 (henceforth to be referred to as reliability 
models RM1, RM2 , RMS, and RM4 , respectively) were first used 
to model the FTMP with all failures treated as permanent. 

The first recursions tc be programmed for this applica- 
tion were RM3 and RM4 . For comparative purposes, an exact 
solution was determined analytically for the probability 
P 3 0 0 (t) ^* e *' t ^ le Probability that the system is still 

operating at time t after having sustained exactly three 
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processor failures, no memory failures and no bus failures) .* 
This exact solution was also programmed and the result used 
to evaluate the accuracy of the two recursive methods. The 
values obtained for t = 30 seconds, for example, when the 


: The exact solution can be expressed as follows: 


P 3,0,0 (t) 


n 


3PJ (l-e' X P t ) 3 -2n (n -2)A(X ,6 ,t) 


•2n p <n p -3 ) B <X p , « p , t) -4n p C < X p , S p , t) 


•2n (n —3) D ( X ,6 ,t) 
P P P P 


- (n -3) . t 
e p Xp 


e “ n m^m t 


A (X , 6 , t) = 


Xe 


-At 


- Ae 


3 - (S+2A) t 


Ae 


-3 At 


3 ( 6+2X) 2<6 + X) ({ 2_ X 2 )(S+2X) 6(4-X) 


B (X , 6 , t) = 


,3 - (6+X) t 

A G 


. -2Xt . -3Xt 
X e ^ X e 


6(6+X) (6 2 -X 2 )(6-2X) 2(6 " A) 3(6 " 2X) 


C ( X , 6 , t) = 


X6 


X 2 6e- (6+X)t 


X “2Xt L 
Xe + 


6 (6+X) ( 6+2X) ({ 2_ X 2 )(5 _ 2X) 2(«-X) 


,2 - (6+2X) t ,, -3Xt 
X e + X6e 


( 6 — X ) ( 6+2X) 3 ( 6 — X ) ( <5-2X) 


D ( X , 6 , t) = 


2 - (6+X) t 


X e 


. 2 - (6+2X) t 
A G 


xV 33t 


3 (6 + X) (6+2X) (6+X) (6-2X) (6-X)(6+2X) 3(6-X)(6-2X) 


with n^, n , n fi denoting the initial number of processors, 
memories, and buses, X X , X their respective hazard rates, 

P m b 

and S the detection rate for processor faults. 
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initial configuration consisted of 15 processors, 9 memories 
and 5 buses, were: 



RM3 : 

p ,(ti 

r= 

.26330 x 10“ 

■15 


RM4: 

p z <t> 

= 

.25575 x 10" 

•15 

— 





15 


Exact 

•• V t! 

= 

.25579 x 10 


— 

Similarly, 

with a 

15 

processor, 8 

memory, 4 bus initial 


configuration, the 

results for t 

= 300 hours were: 

— 

RM3 : 

ft 

= 

.64394340 x 

10- 2 


RM4 : 

p.it) 

= 

.64384685 x 

i<T 2 

— 

Exact 

: P t (t) 

= 

.64384684 x 

10~ 2 


These agreements, especially between RM4 and the exact 
solution are surprisingly good, particularly when it is 
recognized that the "exact" solution is also subject to 
round-off error. 

The results of the comparison between RM3 and RM4 
strongly favored the latter model. Since RM2 presumably has 
the same advantage over RM1 that RM4 has over RM3 , the 
competition, as previously noted, was quickly narrowed to 
RM2 and RM4 . 

Table 3.1 summarizes results obtained using RM 2 and 

RM4 with At = t /50 , and RM 2 with At = t /100 . (A more 
max ■ “ max 

complete listing of the results summarized here and in the 
following examples can be found in an appendix to this 
report.) As can be seen, RM 2 is slightly faster than RM4 
when At is the same in the two cases . The accuracy attain- 
able with RM4 seems to be somewhat better than that attain- 
able with RM2 even when the latter's step size is half (and 
its running time nearly double) that of the former. Note, 
in particular, that halving the step size in the RM2 recursion 
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I ABLE 5.1 


COMPARISON OF THREE NUMERICAL EVALUATION TFCHNIflUFS 


Time Interval 

Estimated Failure Probabilt 
vs. Numerical Evalua' 

ties and Running Times 
tion Technique 

MoDFL FT) 

Elapsed Time 

Integral 

Difference-Eq. 

Difference-Eq. 

Vol. 2 Reference 


From Start 

(50 Steps) 

(50 Steps) 

(100 Steps) 

Table A2- 

1000 HRS. 

20 HRS. 

.9115504128 E-08 

.1)759138922 E-08 

.9013134766 E-08 

h 4, 7 

1000 HRS. 

1000 HRS. 

.2693321948 E-01 

,2693321885 E-01 

.2693322063 E-01 

1, 4, 7 

1000 HRS. 

RUNNING 

41.078 secs, 

40,141 secs. 

76,282 secs, 



TIME 





“ 30 SECS. 

30 SECS. 

.3410688041 E-ll 

,3372096536 E-ll 

.3391798683 E-ll 

2, 5, 8 

30 SECS. 

1200 ms. 

.3189783213 E-13 

.1656918144 E-13 

,2439316532 E-13 

2, 5j 8 

30 SECS. 

600 ms. 

.8284643018 E-14 

-.8526508200 E-22 

.4409338212 E-14 

2, 5j 8 

30 SECS. 

RUNNING 

32 . 866 

31.094 

56,244 



TIME 





800 ms. 

600 MS. 

.8642938477 E-14 

.8421766317 E-14 

,8531124275 E-14 

3j 6j 9 


800 ms. 

.1495029013 E-13 

,1466706658 E-13 

,1480876007 E-13 

3, 6, 9 


16 MS. 

,6686304258 E-17 

,8800221654 E-33 

.3349778148 E-17 

3, 6, 9 


RUNNING 

32.991 | 

30.998 

56.332 



TIME 

i 

\ 










always brings the results obtained more nearly in line with 
those obtained using RM4 . Note, too, the excellent agreement 
between RM4 runs having very different values of t 

max 

Specifically, the t = 600 ms. result obtained when 

t =30 sec. agrees quite well with that obtained when 
max 

t = 800 ms. Yet in the first instance, t = 600 ms. is 
max 

the first point evaluated; in the second case, it is the 
37.5th point (obtained by linear interpolation between the 
37th and 38th points) . This close agreement clearly is 

not obtained with RM2, even when At is halved. 

> 

As a result of these comparisons, it was concluded that 
RM4 is clearly the best of the reliability modeling 
approaches examined, and that it appears to be entirely 
satisfactory, in terms of accuracy, stability and computer 
running time, for the applications of interest. 

Four computer runs were made using RM4 for purposes of 

comparison with results obtained by Draper in their model 

of the FTMP . The results of these runs, with t = 800 ms. 

max 

and 30 sec. are superimposed over results obtained by Draper 

in Figures 3,2 and 3.3 respectively. Figure 3.4 compares 

Draper's results with those obtained from two RM4 runs, one 

with t =10 hrs . and one with t = 1000 hrs. 
max max 

The RM4 results on the whole compare well with Draper's 
results. The reason for the discrepancy in Figure 3.2 is not 
clear. It is conceivable that the discrepancy is due to a 
difference in the assumed conditions under which certain combina- 
tions of latent faults can cause a system failure. The fact 
that Draper's model treats three or more concurrent undetected 
failures as a system failure does not, however, appear to be 
sufficiently restrictive to explain the difference. In any 
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Prob. of System Failure 



— r~ 
10 


T 


100 


1000 HOURS 


Figure; 3. 4 RESULTS VS. DRAPER'S EXTRAPOLATED AND COMBINATORIAL MODEL RESULTS 


(cf. Vol. 2, Tables A2-10 , 11) 



case, the two results agree to within about 20%. 

The agreement between the results obtained with RM4 
and those obtained with Draper's 11-state Markov model 
(Figure. 3. 3) are remarkably good. The agreement between 
the two sets of results in Figure 3 . 4 is also quite good, 
the difference possibly attributable to the difficulty in 
plotting on a gridless graph. + 

3.2.2 APPLICATION TO SIFT 

Four different cases were investigated using RM4 to 
model SIFT. The first three cases (cases la, lb and lc) all 
modeled the computer in a permanent fault environment; 
variations were introduced in order to gauge the sensitivity 
of the model to what appeared to be relatively minor 
perturbations. Case la was postulated to reflect those 
conditions imposed in SRI's reliability model of SIFT. In 
that model, buses are not permitted to fail while a processor 
failure is still latent and processors cannot fail while a 
bus failure is latent. In Case lb, this restriction is 


+For the record, it should be mentioned that the analytical 
expression for coverage used for Table 3.1 was not identical 
to that used for Figures 3. 2 3.3 and 3.4 m the former 

case, the recovery rate associated with a processor or 
memory was equated to the weighted average of the unit's 
recovery rate and those of its . . associated BGU's. In 
the latter cases, the slightly more cumbersome weighted 
average of the corresponding recovery time distributions 
was used. The difference in the results obtained in the 
two cases was small and in no way affects the conclusions 
gleaned from Table 3. i. The change was made before the 
results plotted in Figures' 3.2 3.3 and 3 . 4 were obtained 

since the latter recovery model more accurately represents 
that postulated by Draper . 
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removed, but neither of these two events (bus failure during 
processor latency or vise-versa) causes a system 
failure. This restriction is also removed in Case lc, but 
here either event does cause a system failure. 

The fourth SIFT case (Case 2) involved a coverage model 
similar to that used in Case lb, but the fault environment 
was changed to reflect SRI's transient fault model. 

The results of these four investigations are summarized 
in Table 3,2 as are the corresponding results obtained by SRI. 
As can be seen, the results obtained using RM4 agree remarkably 
well with those obtained by SRI. The fact that the Case la 
and Case lb results are nearly identical demonstrates that 
the restriction imposed by SRI in their model is indeed 
benign. This would be only slightly less true even if the 
recovery from one type of failure were adversely affected by 
a latent failure in a unit of the other type (Case lc) . 

3.2.3 APPLICATION TO FTMP - INTERMITTENT FAULTS 

The CARE III reliability model was used to estimate the 
reliability of the FTMP in the presence of intermittent 
faults. The intermittent fault model used was that defined 
by Draper. That is, when a fault first occurs, it is in a 
"bad" state, i.e., a state in which its effects are manifest. 

It then switches between bad states and "good" states (in 
which the fault is totally benign) at the constant rates $ 
(good-to-bad) and 0 (bad-to-good) . A fault can be detected 
only when it is in a bad state; the fault detection rate is 
then a constant 6 (which may be different for the different 
module types) . 
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Table z 2 SIFT MODELING RESUL TS 

(cf. Vol. 2, Tables A2-54 Through 65) 


n p 

% 

r. 

SECS, 

Trans. 

Exp. 

Case 1a 

Case 1b 

Case lc 

Case 2 

SRI 

10 

5 

10 

No 

-8 

2.1(86301333 

2.486176900 

2.762068196 


2.50 

9 

4 

10 

No 

-8 

1.988342066 

1.988242157 j 

2.186894736 


2.00 

8 

3 

10 

No 

-8 

1). 51(00321(21 

---] 

4.540032421 . 

4.675449311 


i).56 

10 

5 

0.1 

Yes 

-10 




2,511510104 

2.55 

9 

n 

0.1 

Yes 

-10 


* 


r 

2,061165614 

2.10 

8 

3 

0.1 

Yes 

-8 

1 



3,61(1260227 

3.65 , 


y 


Parameters ; 


A o P ‘ 10 4 / hour 
A ob “ 10" S /hour 

f p - °-’ 


Ptr ‘ 


0.9 





















































The results obtained with the RM4 model are listed in 
Table 3.3 along with the results obtained by Draper using 
their Markov model. (To enable comparison, the parameters 
used in the RM4 model for a, 3, <5, X and t. were precisely 
those used by Draper.) The column labeled CARE III, Form 1, 
shows the RM4 reliability predictions when no restrictions are 
placed on the number of faults that can be simultaneously 
present in the system. As can be seen, the reliabilities 
predicted by RM4 are generally very close to those predicted 
by Draper, the difference between the two predictions, however, 
increasing as 3 decreases. It was conjectured that these 
differences were due to two basic differences in the CARE III 
and Draper models: First, the Draper model did not allow more 

than . two faults to be present at the same time, even if some 
of these faults were in the "good" state. Any such situation 
was treated as a system failure. The RM4 model places no 
restriction on the number of coexisting faults so long as 
these faults are not by themselves catastrophic (e.g., simul- 
taneous "bad" faults in two processors in the same triad) . 

The second difference is due to the fact that the RM4 model 
treats as a system failure at time t any combination of faults, 
first appearing at time t, that eventually cause a system 
failure even though the actual failure may occur at some time 
t' > t. When 3 is small and a large, faults spend most of 
their time in the good state. Thus, there can be a significant 
delay between the time a fault occurs and the time that it, 
in combination with some other intermittent fault, produces an 
actual failure. Since the RM4 model treated a system as being 
in a failed state if it contains a combination of faults that 
will eventually prove fatal, it is somewhat pessimistic 
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relative to a model in which such faults are not counted until 
they actually occur. 

The first of these differences is thus due to a restriction 
on the Draper model, the second due to a restriction on the 
CARE III model. In order to overcome this latter restriction, 
a modification was made in the integrand used in the Form 1 
version of RM4 described in paragraph 3 . 1 . 4 . This modified 
version of RM4 , called Form 2 and discussed in detail in 
paragraph 3 . 3 , does take into account the delay between the 
occurrence of a fault and the resulting system failure. The 
results obtained with this model are also plotted in Table 
3.3. As can be seen, the differences between the Form 1 and 
Form 2 reliability estimates can indeed be significant when 
3«-a. 

Finally, in order to determine the significance of the 
Draper model restriction, the same restriction (more than two 
concurrent faults treated as a system failure) was placed on 
the Form 2 version of RM4 . The results obtained with this 
restricted model (Form 2R) are tabulated in the third column 
of Table 3.3. A comparison of these results with those 
obtained by Draper (fourth column in Table 3.3) provides strong 
support for the conjecture concerning the differences between 
the Form 1 model and Draper's model. 

It is believed that in most realistic situations, the 
difference between the reliabilities predicted by the Form 1 
and Form 2 models will be insignificant. It is not possible, 
at this point, to conclude that this difference will be 
insignificant in all cases of interest, however. Accordingly, 
CARE III will implement both models, thereby allowing the 
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Table 3 .3 

FTMP INTERMITTENT FAULT MODEL RESULTS 


(cf. Vol . 2, Tables A2-18 Through 53) 


0 

3 

Failure Probability (x 10~ 8 ) 

CARE III 
Form 1 

i 

CARE III 
Form 2 

CARE III 
Form 2R 

Draper Model 

— 

10 

1 

1.1181 

1.1161 

1.1218 

1.124 

— 

10 

10 

1.2049 

1.2041 

1.2046 

1.207 


10 

100 

1.1720 

1.1718 

1.1720 

1.174 

— 

10 

1000 

1.1274 

1.1274 

1.1275 

1.129 


LOO 

1 

1.0925 

1.0054 

1.2058 

1.2073 


LOO 

10 

1.9392 

1.9072 

1.9219 

1.924 

— 

LOO 

100 

1.6614 

1.6585 

1.6591 

1.661 


l00 

1000 

1.2182 

1.2181 

1.2183 

1.220 

— 

)00 

1 

0.9749 

0.4239 

1.4593 

1.46 

— 

)00 

10 

5.5057 

3.7975 

4.2295 

4 .22 


)00 

100 

6.2531 

6.1513 

6.1668 

6.17 

— 

100 

• 

1000 

2.1208 

2.1198 

' 

2.1203 

2.12 
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user to decide whether or not the more accurate reliability 
prediction afforded by Form 2 justifies its increased running 
time. (Form 2, when applied to FTMP, requires about three 
times as much CPU time as does Form 1.) 

3 .3 RELIABILITY MODEL STRUCTURE 

Preliminary evaluation of the various reliability modeling 
techniques under consideration was accomplished by defining 
analytically the coverage functions needed for the test cases 
described in the previous paragraphs. This task can be 
arduous, however, and severely restricts the coverage model that 
can be accommodated. The reliability model was therefore 
restructured, both to increase its generality and to enable it 
to use coverage parameters generated by a coveracre model of 
the sort implemented in CARE II. The new structure distinguishes 
among inputs defining the system structure, inputs specifying 
the underlying fault models and coverage-model— generated inputs 
characterizing the system's response to various categories of 
faults. This structure is described in detail in the follow- 
ing paragraphs . 

3.3.1 SUBSYSTEM CHARACTERIZATION 

The reliability model to be described here is designed 
to model the reliability of a subsystem consisting of some 
arbitrary number of stages. The system reliability is then 
determined by taking sums of the products of the reliabilities 
of appropriate sets of subsystems multiplied by the probability 
that no category 3 faults have occurred. (cf. section 2). This 
last procedure, while relatively straightforward, has not yet 
been implemented and hence will not be discussed here. 
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(Combining subsystem reliabilities to determine the system 
reliability clearly requires knowledge of the various success- 
ful system configurations as interpreted by CARE IN . Accordingly, 
implementation of this operation has been deferred until after 
CAREIN has been more fully defined.) The discussion here 
concerns the task of modeling the reliability of arbitrary 
subsystem configurations. 

Each stage in a subsystem consists of some number of 
identical modules or units; since the subsystem is fault- 
tolerant, it can presumably continue to operate successfully 
even after some of these units have failed. The probability 
that the subsystem recovers from a fault (i.e., its coverage 
for that fault), however, may depend upon many factors, 
including both the number of detected faults and the number of 
undetected faults in other modules in the same subsystem. 

(If the coverage associated with a fault in one stage is a 
function of the number of faults in some other stage, the two 
stages are said to be coupled.) 

For notational convenience, each stage will be indexed by 
a Latin letter. Stage x, for every x, is subject to faults, 
each of which belongs to some category x^ , i - 1, 2, ... . 

The subsystem state is represented by a vector L = ( 1 ^, 

£ X 2 ' ^yi' V'>' •••)/ ^x^ indicating the number of 

stage x units that have experienced a category x^^ fault, etc., 
with each stage and each fault category thus represented. The 
parameter $, x represents the total number of faulty stage x 
units, = (... & x , ...) is a vector whose components 

indicate the number of faulty units of each type, and 

1 ■ E l x 

x 
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the total number of faulty units. Similarly, the vector 

H ~ > y v / ••♦f y v / U,, / y,, , ...) designates the 

*1 X 2 x m y l y 2 

number of latent faults in each category. (A fault is called 
latent if it has not yet been isolated.) 

In addition to the preceding categorization, faults are 
also classified in accordance with their effect on the subsystem 
of concern at the time of their occurrence. Specifically, 
faults are divided into three classes: (1) Subcritical faults. 

A fault is said to be subcritical if it, by itself, cannot 
cause a subsystem failure in the absence of subsequent faults 
(e.g., the first processor fault in SIFT or FTMP) . (2) Critical 

faults. A fault is called critical if it, in combination with a 
pre-existing latent fault, may eventually cause the system to fail 
even in the absence of subsequent faults (e.g., certain processor 
faults in SIFT or FTMP while a previous fault is still undetected) . 
(3) Supercritical faults. A fault is designated supercritical if 
its occurrence causes the subsystem to fail immediately, possibly 
but not necessarily, as a result of pre-existing faults (e.g., 
faults causing single-point failures) . 

If a category y fault is critical in the presence of a 
pre-existing latent category x^ fault, the subsystem is said 
to be in an x . y .-critical state. Such a state is possible, 
for example, when faults (or their effects) are intermittent 
in nature. Faults of this sort will be said to be either active 
(i.e., capable of generating errors) or benign (not active). A 
subsystem in an x^-critical state will fail in the absence of 
other faults, if, and only if, both faults are simultaneously 
active. (This statement effectively defines the terms "active" 
and "benign.") It will be assumed that any other fault occurring 
while the subsystem is in a critical state will also cause it to 
fail. (The significance of this assumption is discussed later.) 
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3.3.2 SUBSYSTEM RELIABILITY MODEL 


Table 3.4 defines the inputs needed for the restructured 
Form 1 and Form 2 reliability models. The various inputs are 
divided into three categories: 1) those provided by the user 

in defining the subsystem configuration; 2) those defined by 
the user in selecting fault models; and, 3) those determined 
by the coverage model. Table 3.5 defines both mathematically 
and in words the functions of these inputs evaluated by CARE3 
(cf. section 3) and used to define the integrand in the RF4 
version of the Kolmogorov recursion. 

The RM4 recursion can be expressed in terms of these 
functions as follows (cf . equation 10) : 

/ t 

e -Aj(t,T) K ^ (T)dT (H) 

0 

with (t,T) = J' t X^(r|)dri. The Form 1 version of ( x ) can 

T 

be expressed as 


[Q (t) + p* (t)c (t)] (n -A +1)X (T) (12) 

z. b y c. -e y y 3 y y y^ 

with = (... & x , Jty-l, & z , ...) and with 

Cy.(T) = D yj U-e y ,T) + ^2 B Xl , y j <t-s y »T)g 1 (T, x d , y.) (13) 

x i 

Equation (11) is identical to equation (10) but with a sliqht 
change in notation to emphasize the relationship between 


K * ■ £ 
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Table 3'. 4 


Source 

User: 

configuration 

description 


User : 

fault model 
selection 


Coverage model 
outputs 


CARE 3 INPUTS 


Function 




(p. 


i> 


d y (iL' £) 

Y j 


n 

x 


m 


x 


q x (t) dt 
i 


P 1 (t|r,x i ) 


P 2 (t|T,x i ) 


Definition 

Probability that a category y . 
fault would place the system 
in an x. , y. -critical state 
given that the total number of 
faults and the number of latent 
faults of each category, just 
prior to the occurrence of the 
category y . fault are defined 
by %_ and y, respectively. 

Probability that a category y • 
fault would be supercritical 
given and 

Number of initially function- 
ing stage-x modules. 

Minimum number of functioning 
stage-x modules needed for the 
system or subsystem to function. 


Probability that a category x^ 
fault occurs in a given stage 
x module in the interval (t, 
t+dt) . 

Probability that a category x^ 
fault is active but undetected 
at time t given that it occurred 
at time t. 

Probability that a category x. 
fault is benign but undetected 
at time t given that it occurred 
at time x . 


Source 

Coverage model 
outputs 


Table. 3.4 (Cont.) 

Function Definition 

p(t|x, x^, y.) Probability that any x.y.- 
^ critical state, entered ^t 

time t, persists until time t 
(i.e., neither fault has been 
detected nor has a subsystem 
failure occurred) . 

q ( t | t , y.)dt Probability that a system 

failure occurs in the interval 
(t, t+dt) as the result of an 
x^y . -critical state entered at 
tim^ t . 
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state transitions and the fault category. (Note that the 
summation here is over all fault categories.) Equation (13) 
expresses the coverage failure probability in terms of the 
functions defined in Table 3.5. That is, the probability of 
a coverage failure is just the probability that the fault in 
question forces the subsystem into a supercritical state 
plus the probability that the fault forces it into an x.y.- 

X J 

critical state which eventually causes it to fail. 

The Form 2 expression for K^(x) is 

V T > 

+ A'ix ly p* ( t) (14) 

Here c (t) is as defined in equation (13) but with 

Y j 

g (x, x., y.) replaced by g (x, x., y.). This reflects the 

i 3 4 - l J 

fact that in the Form 2 recursion, a subsystem failure is not 
counted until it actually occurs. Thus, a fault forcing the 
subsystem into a critical state does not actually cause the 
system to fail at that time unless the pre-existing fault is 
active. The term A' (t|A) accounts for subsystem failures 
occurring at time x as a consequence of previously entered 
critical states that did not immediately cause a failure. 

The term A(x|H-e^) reflects the fact that any third fault 
occurring while the subsystem is in a critical state is 
assumed to cause it to fail. 


P* 

l-z 
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Table 3 .5 


CARE 3 FUNCTIONS 


Function 

Mathematical Expression 

Definition 

r x. (t > 
1 

1 - J t q x (x)dT 
0 1 

Probability that a given stage 
x module has not experienced a 
category x^ fault by time t 

r x (t) 

n r x. (t) 
i 1 

Reliability of a stage x 
module 

X x. (t) 

1 

Ln 

a> 

V (t)/r x. (t) 
1 1 

Rate of occurrence of category 
x_^ faults in a given operational 
stage x module 

x t (t> 

E (n x - vEv w 

x i 1 

Rate of occurrence of faults 

in the V'' (n - % ) modules 
x x 

X 

that are fault-free at time t~ 

a x i (t) 

j Ps^K, x i )r x (x)A x (r)dx 
0 

1 - r (t) 

X 

Probability that a given stage 
x module has a category x. 
latent fault at time t given 
that it has experienced some 
fault by time t 


rp s (h|x, x ± ) - P 1 (t|x, X.) + p 2 (t |x, x ± )] 






a (t) 
x 


£ a x. (t) 
i 1 


p( vV 


J l I (1-a (t) ) ^x a x?i (t) 

X X 1 

U x " y x> 1 i V 1 


^ p (y|£, t) 


np( u x U x , t) 


D (£, t) 
^i 


Ev e 


, £) P (y | £ , t) 


B x.y.'i' fc) 
l] 


Z 


b x v (R* t> 

f i- j 


Definition 


Probability that a given 
stage x module has a latent 
fault at time t given that it 
has experienced some fault 
by time t 

Probability that a subsystem 
contains y^ stage x latent 
faults given that it has 
faulty stage x modules 


Probability that a system 
having faulty modules has 
latent faults 

Probability that a system 
containing .£ faults would be 
in a supercritical state were 
a category y. fault to occur 
at time t 


Probability that a system 
containing faults would 
enter an . -criticalstate 
were a category y. fault to 
occur at time t ^ 


Table 3 . 5 (Cont . ) 


Function 


g 2 (t. x.) 



Mathematical Expression 


Definition 


x i )r x (r)X x (r)dt 
i 


Probability that a category 
x^ fault is active at time 
t given that it is latent at 
time t 


a (t) (1 - r (t)) 

x . x 


g 2 (t , x ± ) 


Ln 

oo 


A(t|&) 


A' (t|£) 


1 - [1- 


OO 

g 2 (t, x.) ] [i 7 q(f|t/ y_.)dx] 


E 


(n -l +1) 

y y 


x. ,y . 
i D 


f b a- 

/ x.y. - 

J ID 

0 


- £ y' T) 


q y (t) (l-g 2 (t , x i ))p(t|x, x. , y_.)dx 


Zv 


x. ,y . 
I D 


i D 


V 1 ’ / B Vj-‘y' Tl 


q (t) (1-g (x, x. ))q(t|x, x. , y . ) dr 
y * ^ i i n 


Probability, given that a 
system enters an x .y . -critical 
state at time t, tiiatl this 
event eventually causes a 
system failure 


Probability that a system 
having faults is in a criticc 
state at time t (£-e ) = 


(. . .£ 


x 


l - 1 , 

y 


V-*) 


Rate at which systems having 
& faults fail at time t due 
to critical fault conditions 


i 


i 




There are several assumptions implicit in these expressions 
which should be noted: 

1. It is assumed that a;.faulty module can be character- 
ized by the first fault it experiences, although the possibility 
of subsequent faults is not excluded. (See, for example, the 
expression for a (t) in Table 3.5.) If a second fault does 

X 

occur, it could have one of three effects: a) it could 

shorten the latency period; b) it could cause the subsystem 
to fail only if the first fault is still latent; c) it 
could cause the subsystem to fail even if the first fault 
has been detected. 

The first of these effects can be accounted for in the 
coverage model, the second by adding a term to the recursion 
integrand K^(t) to account for that possibility, and the 
third can be modeled as a "category 3" failure. It is 
proposed, however, to ignore the first effect and to combine 
the second and third effects in estimating the probability 
of a category 3 failure. The rationale for this is as 

follows: The likelihood of a second failure during the 

latency period of a previous failure in the same module is, 
in most instances, entirely negligible. In any event, the 
approach just described overbounds the subsystem failure 
probability. (Ignoring the reduced latency caused by a 
second fault is clearly pessimistic. Treating the second 
effect in a separate category results in some "double count- 
ing"; i.e., a single fault is allowed to cause the subsystem 
to fail twice, once as a result of a second failure in the 

same module and again as a consequence of a failure in some 

other module.) The increase in the failure probability 
estimate as a result of such approximations is clearly 
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insignificant for all cases of practical interest. Thus, 
while more exact expressions could be relatively easily 
incorporated into CARE 3 and COVRGE to account for such 
events, their minor importance does not appear to justify 
the added complexity. 

2. Critical states are defined only for pairs of latent 

faults. It is possible, for example, to define an x.v.z, - 

1-* 3 k 

cr itical state in which a failure occurs only if all three 
faults are simultaneously active. None of the fault-tolerant 
systems examined thus far, however, have exhibited such 
failure mechanisms. Thus, while the reliability model struc- 
ture described in the preceding paragraphs could readily 
accommodate a more general critical-state definition, the 
resulting added complexity does not seem to be justified. 

3. Any new fault occurring while the subsystem is in a 

critical state causes it to fail. In many cases this is in 

fact not true; an arbitrary fault does not necessarily cause 

the subsystem to fail even when it is in a critical state. 

The purpose of making this assumption was, of course, to 

eliminate the need to account for even more complicated 

fault patterns involving, for example, simultaneous x.y.- 
^ . • 1 3 

and X i Z k“ Critlcal states. Once again, the probability of 

such events is small, and the complexity needed for more 

precise estimation does not seem to be justified. 

(It should be noted that the restriction under discussion 
here is considerably less severe than the restriction that 
three simultaneous latent faults cause a failure, as is 
evident from the results in paragraph 3 . 2) . 
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3.3.3 SPECIALIZATION FOR FTMP AND SIFT 

The input parameters used for the FTMP and SIFT test 
cases discussed in paragraph 3.2 are listed in Table 3*. 6. The 
FTMP model used for intermittent faults recognized only three 
rather than five fault categories; in this case the input 
parameters are as defined in Table 3.6 but with £r>_ = = 

r 2 m 2 

y p2 = p m2 = 0,;L * 

The definition of these parameters is relatively straight- 
forward. The functions' m q , N Q and N x are just the probabilities 
that no two modules in any FTMP processor or memory triad 
both contain latent faults, that no active bus contains a 
latent fault, and that exactly one active bus contains a 
latent fault, respectively. Thus, b pip2 (P/ ^ for example, is 
the probability that no two processors or memories in any 
triad contain latent faults, that no active bus contains a 
latent fault, and that, should a category p£ fault occur, it 
would affect a processor in a triad already suffering from a 

latent category p fault. Similarly, the parameter b , (y, Z) 

± pb — — 

is the probability that no two processors or memories in any 
triad contain a latent fault, that no active bus contains a 
latent fault, that all memories having latent faults and all 
but one processor having a latent fault use the same bus, and 
that that bus is the one to be affected should a bus fault 
occur . 

One class of fault situations in the FTMP requires 
special consideration. Suppose one of the active buses con- 
tains a latent fault, that all processors and memories 
containing latent faults use that bus and that at least one 
processor does contain a latent fault . Then a new processor 
fault affecting the triad already containing one latent fault 
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Table. 3 .6a 


FTMP INPUT PARAMETERS * 


b ~ ) 

P^ P 2 m 2 b 


1 = (£ n' l m ' V = ^ + 1 A = A + A 

P m b P P x P 2 m m 1 m 2 


V' V' V V V **p - “p + " 


y 

1 P 2 m 


y m + y m 

1 m 2 


Let 


W V " 


f 1 y x = 0 , 1 

(l W y x~ 3) (n x"V y x~ 6) * * * (y„-D ) 


X X X 


X 


la x ~ l x +v x- V (n x'V ll x' 2) • • • (n x-V u x' ( 'V 1) > 


N 0 (p b’ V 


|vV W 1 ’ 'y^- 2 ' 

(n b'V p b ) < VW 1 * ( VVV 2) 


w V ■ 


3tn b-*b> (n b'V 1,p b 


“W'b 1 ‘’W'V 11 (n b-V p b- 2) 


m q ( y / A) 


M_(y , A )M A (y , A ) 
Op p 0 m m 


Then (for x ± = p x , p 2 , rt^, m 2 ; x = p, m) : 
2y x 

b x.x> i> - rtrV*' i>VV V 

1 D xx 


* See "List of Symbols" for verbal definitions. 



2y 


vt(l) 


x_. ... y + y -1 

p m 


M 0 -* N 0 ^ y b f A b^ 


~ n -i -2 y , y +y 

b (u jn = — — x - x / 1 \ P m 

bx — 3 n -A 1 3 

1 XX 


M 0 (y, 5 1 )N 1 (y b , £ b ) 


b bb ( ^ £> = < 


n b -Z b N 1 ( V V y p “ lJ m " 0 


otherwise 


, b p . m ,<E. i> = b m . p .(E- i> = » 

i D i ] 


2y 


« X .(E. A> = H-rr <t) ^ U "' M o ( !i' i>W V 

1 XX 


f o 


d h (y, = J 


n b- i b 


u+u =o 

p m 


/i \ y^ + V™ i y +y 

1 -(i) *> - 2« Mp + p m , ,i, P “ 


W V 


+ — V- (t) P P' n,r " N, <y. , 1. ) 
n -JL 3 1 b b 

b b 


M 0 (p, A) 


u + > ° 

p m 


X x. (t) = X x. ; X b (t) = X b 
1 1 
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Table 3 ,6b 


Case la: L 

b pp <E- i> 

b pb ( — ' *> = 
b bb ( ^' — > - 

d xy<H- « = 


SIFT INPUT PARAMETERS * 


= i = (^p. * b > t - <K p . M b ) 

l y = (l, 0) 

0 otherwise 


b bp ( ^' A) ~ 0 


i y = (o, l) 

0 otherwise 


0 all x, y 


* b <t) 


X 


b 


Cases lb, 2 


(two independent subsystems) : 


L = l = l 


L = H = l. 


b (y, SL) 

pp r ' 


Vs y = 1 

0 V/1 


b bb (y ' £) 


Vs y = 1 
o y ? 1 


d (y, £,) = 0 

pp K ' 


dfab (y / a) = o 


X p (t) = X 0p s 


x b (t) - x 0b s 


s = 1 + r(l-p tr ) 


(r = 0 for case lb) 


See "List of Symbols" for verbal definitions. 
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Table 3 ,6b (Cont .) 


Case lc 

Same as case la except: 



1 y = (1/ 0) 

0 otherwise 


b b P (!1 ' - 


1 P = (0, 1) 

0 otherwise 
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creates two critical situations: a bp^ -critical fault and h 

PiPj- c ritical fault. Although such an event does not 
necessarily cause the system to fail, it was elected to treat 
all such events as fatal and hence to reflect their probabil- 
ities in the d (y, £) parameters. Since these are clearly 
events of relatively low probability, the added complexity 
needed to account for the possibility that the system could 
recover from them was not felt to be justified. Treating all 
such events as system failures, of course, again overbounds 
the true failure probability. The parameters dp^(y, i) and 
djn. (y , l) thus account for the event just described. The 
parameter d^(y, V) is the probability either that at least 
two buses are used by memories or processors containing 
latent faults or that one bus and at least one memory or 
processor contains a latent fault and that, when a new bus 
fault occurs, it affects an active bus. Again both events 
produce a pair of critical fault situations . 

The SIFT parameters shown in Table 3 .6 are largely 
self-explanatory. The first three cases (cases la, lb and lc) 
differ only in the nature of the coupling between the two 
stages (cf. paragraph' 3.2). The fourth case allows transients 
to occur at a rate r times the permanent fault rate. Since 
the probability of a "leaky" transient is 1 - p and since 
leaky transients do not produce coverage failures, the 
probability that an arbitrary fault produces a critical fault 
situation is reduced by the probability 1/[1 + r(l-p^ r )] that 
the fault is a leaky transient. 

In addition to the parameters specified in Table '3 .6, the 
CARE3 model must have access to the functions, defined in 
Table 3.5, used to characterize coverage. Since COVRGE has 
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not yet been implemented, these functions were generated by 
hand. These functions are easily defined in the permanent 
fault case: 


P x (t | x, x i ) 


f e (t-T ) 


1 0 < t-T £X, 


0 otherwise 


FTMP 

SIFT 


P- 2 ( 1 1 T , X i ) = 0 


p ( t'l t , x it y j ) = 0 
q (t | T , x i , y ) = <$ (t) 

with 6 X ^ the FTMP fault detection rate, x Q the SIFT detection 
delay, and 6(t) the Dirac delta function. 

In the FTMP intermittent case, the first two of these 
functions are defined by a three-state Markov model and the 
last two by a five state Markov model, as shown in Figure 3.5. 
If represents the probability of being in state i at 

time t given that the system described by the three-state 
Markov model was in state j at time t, and if P. . ( 1 1 t) is 
similarly defined for the five-state model, then 

P 1 (t|x, x) = P 1;L ( 1 1 t) 

P 2 (t|x, x) = p 21 (t|x) 

P (t I X , X, y) = P X1 ( 1 1 X ) + P 21 (t|x) + P 31 ( 1 1 X) 

q(t|x, x, y) = 3 tP i;L (t | x) + P 31 (t | x) ] 
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ACTIVE 


BENIGN 


Single-fault Markov model 


FAULT 


DETECTEI 


DETECTED 


X ACTIVE/ 
y BENIGN 


BOTH 

BENIGN 


X BENIGN 
y ACTIVE 


SYSTEM 

FAILURE 


b) Double-fault Markov mocTe] 


Figure 3.5 INTERMITTENT FAULT MODEL 






The functions p. .(t|t) and P..(t|x) are readily determined 
either by hand (the first function involves solving a 
quadratic equation, the second a cubic) or by using one of 
the techniques described in Appendix 1. 
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3 * 4 PROGRAMMING APPROACHES FOR SYSTEM UNRELIABILITY MODEL 


3.4.1 INTRODUCTION 

The following paragraphs describe the techniques used 
to program the reliability model RM4 postulated in paragraph 
3.1.4. For illustrative purposes, the parameters and 
dimensions discussed are those used for the FTMP model. As 
will become apparent, however, these parameters and dimensions 
can be readily modified as required to accommodate other 
situations . 

3.4.2 COMPUTATION OF (t) RECURSIVELY 

In order to compute the probabilities Q (t) recursively 

X/ 

where JL -*■ (i, j, k)+, an array must be defined for the Q (t) 
probabilities so that Q ,(t) , Q. . (t) and 

1 / J ~ J- / K 

^i, j, k-1 ^ can k e accessed when computing Q. ^(t). 

If NP = no. of processors = 15; NPS = no. of processor 

survivors = 2 

NM = no. of memories = 9; NMS = no. of memory survivors = 2 

NB = no. of buses = 5; NBS = no . of bus survivors = 2 

ITMAX = maximum no. of time steps = 50 

QLT = array representing Q^(t), then 

the array QLT must be dimensioned 


tFor purposes of this example, & is a three-dimensional vec- 
tor, i = (i, j, k), with i denoting the number of failed, 
processors, j the number of failed memory units and k the 
number of failed buses. 
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(NP - NPS + 1, NM - NMS + 1 , NB - NBS + 1, ITMAX) = (14 , 8, 4, 
51), which includes 0 processor failures, 0 memory failures, 
0 bus failures and time 0. 


The immediate requirement then becomes the definition of 
a loop structure within the program for computing Q (t) so that 

At 

all required probabilities have been previously computed and 
stored in the array. For example, when computing Q^(t) for 

~ (3 ' 2/ 1} ' Q 2 , 2, l (t) ' °3, 1, l (t) and °3, 2, 0 (t) must 

have been previously computed and stored in the QLT array. 


Let II, JJ, 

representing Q. 

1 1 

then as shown on 


KK, IT be the indices into the QLT array 

• v (t) . The basic structure in FORTRAN is 
3 / a 

the following page. 
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C Basic Fortran Algorithm 
C 

NPP1 = NP + 1 
NMPl = NM + 1 
NBPl = NB + 1 
DO 100 KK = 1, NBPl 
DO 100 JJ = 1, NMPl 
DO 100 II = 1, NPP1 
C 

I = IIM1 =11-1 
J = JJMl = JJ - 1 
K = KKMl = KK - 1 
DO 75 IT = 1, ITMAX 
C Compute Q (II, JJ, KK, IT) using 

C Q ( IIMl , JJ, KK, IT), Q (II, JJMl, KK, IT), 

C Q (II, JJf KKMl, IT) where computing subroutines 

C use- I , J and K 

75 Continue 
C 

100 Continue 
C 

This structure would compute the state probabilities in 
the sequence as shown on the following page. 
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QLT , L 


Indices 
(II, JJ, KK) 


Example 

Reauired States 



NPPl NMP1 1 

112 
2 12 

3 12 


NPPl 


13 



NPPl NMP1 NBP1 


* A state vector with an index of 0 is defined as having 0 

probability because a 0 index represents a negative component 
in the state vector (i, j, k) , and hence designates a non- 
existent state. 
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Clearly all state probabilities will have been previously 
defined and stored in QLT array so that they are available when 
required . 

Several problems occur if QLT is dimensioned and computed 
in this manner: 

1 . CDC Fortran Extended allows a maximum of 3 array 
declarators. Therefore the statement: 

DIMENSION QLT (14, 8,4, 51) 

is an illegal declaration and will not compile. 

2 . The amount of memory required for such an array 
would be enormous : 

14 x 8 x 4 x 51 words, i.e., 22,848 words 

3. Extending the model to include, for example, I/O 
modules would cause a problem because this would 
recruire an added dimension to the array (if 
such a dimension were legal) . This would also 
increase the size of the QLT array even further. 

4. Unnecessary computation of state probabilities 
would result — namely those which are so small 
that they have no affect upon the resultant 
probability. For example, the probability 
associated with state (13, 6, 3), i.e., 13 
failed processors, 6 failed memory modules and 3 
failed buses by time t may be too small to effect 
the system probability as a whole. 

The solution to problem 1 is to create a mapping of 
( i , ‘ j, ^) ►n which will reduce the QLT array to 2 dimensions: 
QLT (NMAX , IT) . This will also solve problem 3; extending the 
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model from (i, j, k) — -n to (i, j, k, m) — -n would be a 
relatively minor programming enhancement. The only part of the 
program to change would be the mapping routine — plus model 
changes due to the addition of vector component m. This 
dimension solution, however, has no effect upon the size of 
the QLT array. The dimension statement now becomes 
DIMENSION QLT (448, 51) and would require the same amount of 
storage as previously. 

The solution to problems 2 and 4 would be to modify the 
basic loop structure defined above so that: 

a. The state probabilities are computed in a 
flow from largest to smallest; this 
enables the program to halt execution at a 
point where the probabilities no longer affect 
the result; 

b. Only those probabilities actually needed to 
calculate the current state probability have 
to be stored in array QLT at any one time, 
thus reducing its size. 


The following chart lists the computational flow required 
versus the basic computational flow. Each set consists of all 
permutations of vectors where the largest component of any 
vector is the set number. Vectors with components all less 
than the current set number were defined in previous sets; the 
probabilities associated with these vectors are not recomputed. 
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COMPUTATIONAL FLOW 0^ STATE VECTORS 


CHART 1 


BASIC COMPUTATIONAL 
FLOW 


MODIFIED COMPUTATIONAL 
FLOW WITH SETS 


EXAMPLE REQUIRED STATES 


II JJ KK 


II JJ KK II-l JJ KK , II JJ-1 KK, II JJ KK-1 



Set 4 


3 

1 * 


(3,4, 4) , (4, 3, 4) , (4, 4, 3) 


14 


Set 5 


4 

1 * 


5 5 4 

Set 6111* 


6 6 4 
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COMPUTATIONAL FLOW OF STATE VECTORS 


CHART 1 


MODIFIED COMPUTATIONAL 
FLOW WITH SETS 


Set 7 


II 

1 

2 

3 

4 

5 

6 


JJ 

1 

1 

1 

1 

1 

1 

1 


KK 

1* 

1* 

1* 

1* 

I* 

1 * 

1 


Set 14 


14 


*These state probabilities have been previously conouted and will not be recomputed. 
They are only dummy place holders used to show the algorithm more clearly. 

**States with 0 indices do not exist. 
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The chart shows that only two sets need be in memory at 
any one time — the set being computed and its predecessor set. 
This occurs because the required states have either been com- 
puted in the predecessor set or previously in the set being 
computed. Also, with this method, only the state probabilities 
not computed in prior sets are stored in array QLT . Therefore, 
the number of unicrue states in each set for the case where 

NP = 15, NPS = 2 
NM = 9, NMS = 2 

NB = 5, NBS = 2 

is shown in the following chart: 


Set 


No. of Unicrue States 


1 

2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 

14 


1 

7 

19 

37 

36 

44 

52 

60 

32 

32 

32 

32 

32 

32 


largest two con- 
secutive sets 


CHART 2 
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Set 7 and 8 are the largest two consecutive sets — having 
52 and 60 states, respectively. Therefore, OLT arrav need 
only be dimensioned (112, 51), which is a total of 5712 words. 
Using this method, the amount of storage reauired for OLT 
array was decreased by 17,136 words. 

The Fortran code reauired to compute the OLT array in 
sets , with only two sets of probabilities in memory at anv 
one time follows: 

C FORTRAN ALGORITHM TO COMPUTE SETS OF STATES 
C 

C Compute QLT (1, IT) for l — *(0 , 0, 0) directly for all time 

steps . 

* 

% 

C Initialize NSET(ISET) for set 1 to 1— only one state 
vector exists in set 1: (0, 0, 0) . 

NSET(l) = 1 
C 

C Compute maximum number of failures permitted including 0 
NPF = NP - NPS + 1 

NMF = NM - NMS + 1 

NBF = NB - NBS + 1 

C 

C Compute maximum indicies. 

XPP1 = NP + 1 
NMPl = NM + 1 
NBP1 = NB + 1 
C 

C Determine maximum set to compute. 

>5AX = MAX0 (NPF, NMF, NBF) 
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c 


C Compute sets of state vector probabilities. 

DO 200 ISET = 2, MAX 
ISETB = ISET 
ISETM - ISET 
ISETP = ISET 

IF ( ISETB. GT.NBP1) ISETB = NBPl 
IF ( ISETM. GT.NMP1) ISETM s= NMPl 
IF (ISETP .GT.NPP1) ISETP *= NPP1 
C 

C Initialize QLT index N to the number of vectors in the 
C previous set plus one. 

NUMPREV = NSET (ISET-1) 

N - NUMPREV + 1 
IF (ISET.EQ.2) GO TO 60 
C 

C Pop vector probabilities off QLT array which were defined 
C two sets ago by moving the predecessor set up in the array. 
NPOP = NSET ( ISET-2) 

DO 50 M = 1, NUMPREV 
MM = NPOP + M 

C Transfer QLT (MM, IT) for all time steps. 

DO 50 IT = 1, ITMAX 
QLT (M, IT) = QLT (MM, IT) 

50 CONTINUE 
C 

60 Continue 
C 

C Initialize unique state vector's counter to 0. 

NSTOT = 0 
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c 

C Begin main three loops which define the state vectors 
DO 100 KK = 1 , ISETB 
DO 100 JJ = 1, ISTEM 
DO 100 II = 1, ISETP 

C Do not compute any previously computed state vector 
C probabilities. 

IF (II .LT.ISET. AND.JJ.LT. ISET. AND. KK .LT.ISET) GO TO 100 

I = II-l 
J = JJ-1 

K - KK-1 

C 

C Compute QLT(N, IT) for all time steps. 

DO 75 IT=1 , ITMAX 

75 CONTINUE 
C 

C Increase QLT index N and unique vector counter NSTOT by 
C one . 

N = N + 1 
NSTOT = NSTOT +1 
C 

100 CONTINUE 
C 

C Store total number of unique vectors for the current set 
C ISET. 

NSET(ISET) = NSTOT 
200 CONTINUE 
C 
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This Fortran structure is the basic programming core for 
the various CARE III models programmed thus far. 


3.4.3 PROGRAM DIFFERENCES PER MODEL 


The subroutine which computes the unique mathematical 
calculations for each model is subroutine SUMMAT. This 
subroutine and its associated functions vary for each model. 
They represent the numerator in the integ/and of the inte- 
grated form of the Kolmogorov equation: 


SUMMAT 




- e ~o 


V 


/ r 

[EQ-jfT) + P J ( T >C j|l (T)]x. )l (t) 

0 e'X V t ’ )a ’ 1 


dT 


The main concern in programming subroutine SUMMAT for 
each model is to eliminate redundant computations. Two types 
of function computations are required: functions which are 

time dependent and functions which are vector dependent; i.e., 
dependent upon (i, j, k) . The time dependent functions must 
be removed from subroutine SUMMAT and computed in subroutine 
TDZPEND. TDEPEND computes all time dependent functions once 
and stores them in arrays. These arrays can later be accessed 
from subroutine SUMMAT each time the vector changes . This 
approach keeps execution time at a minimum because it takes 
much less time to retrieve a function value from an array 
than it does to recompute the function each time the vector 
(i, j ; k) changes. SUMMAT then computes the vector dependent 
portions of the model while accessing the time dependent arrays . 
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3.4.4 NUMERICAL INTEGRATION TECHNIQUES 
The Trapezoidal rule 



1 

f (x) dx 




+ f (x 1 ) ] 


and Simpson's 1/3 rule 

r 


Ax 


x. 


f (x) dx = [f(x Q ) + 4f(x 1 ) + f(x 2 )] 


are the numerical integration technigues used within the 
program to compute the integral 



.? °j <T) + P i 

7 s dx 


~f x x.(n)dn 


e *o * 


of the Kolmogorov equation. 

The Trapezoidal rule is used to compute the integral 
from time 0 to time STEP where STEP is the step size or Ax. 
Simpson's 1/3 rule is used to compute the remaining intervals 
as shown in the following Figure 3.6. (The subroutines associated 
with these numerical techniques are called TRAPINT and SIMPINT.) 
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INTEGRATION METHODS 
Figure 3.6 



3.4.5 MACRO FLOW CHART OF SYSTEM UNRELIABILITY MODEL 

The following macro flow chart shows the organization of 
the entire basic model which computes the system unreliability. 
The loop structure computing the vectors in sets is shown in 
relationship to the subroutines TDEPEND, SUMMAT, TRAPINT and 
SIMPINT. 






LOOPS TO 
COMPUTE 
STATE VECTOR 
(I, 0, K) 


CARE 3 MACRO 
FLOW CHART 

















LOOP TO 
COMPUTE 
TIME STEPS 
1 TO ITSTPS 


IT = 1 

IT = 
IT+1 


IT < 
ITSTPS 


P*- 

QLT(N.IT) 


SUMMAT 

COMPUTE 

X-l 

K ) 

j = o 


PRINT 

PROBABILITIES 
FOR VECTOR 


COMPUTE 

P* 

(perfect 

coverage) 


n\ 

. yes 

/ TRAP I NT ' 
J COMPUTE QLT 

? & 

\ (N,2) USING 


\ TRAPEZOIDAL 

X 


\rulT_^ 


SIMPINT \ 

COMPUTE QLT 
(N, IT) USING 
SIMPSON'S . 
1/3 RULE / 


QLT(N.IT): 
QLT (N, IT) 



CARE 3 MAC 
FLOW CHART 



4.0 CARE III PROGRAM STRUCTURE 


An implementation of a Modularized Direct Access 
Information System is the proposed structure for the CARE III 
system. The system will consist of three main modules: 

a. Batch or interactive input processor: 

CAREINB or CARE IN I 

b. Coverage model: COVRGE 

c. Reliability model: CARE 3 

The following flow diagrams depict the proposed design 
of the CARE III system. 

Two text input files are required: one to define the 

computer configuration and one to aid in the calculation of 
the coverage model. If coverage is preset per stage in the 
configuration file INFILE, the coverage input file CVFILE 
need not be defined by the user. 

The Direct Access Information System (DAIS) files generated 
by CARE III are designed to be random, word addressable mass 
storage files. Each record within these files can be 
accessed with a master index or subindex (es) . The DAIS 
files will contain the processed user input required by 
programs COVRGE and CARE 3 . They will be made permanent disk 
files by CARE III so that they can be modified if desired 
without having to reinput the entire data set. Thus a second 
run can use existing files CAREDF and CARECV, after minor 
modifications have been made to them, by running program 
CAREIN using only an updated portion of the original input. 

This capability is especially convenient if the user runs the 
interactive CAREIN program. 
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The DAIS files are to be created and accessed through 
the use of FORTRAN Mass Storage Input/Output (MSIO) 
subroutines OPENMS, WRITMS, READMS and CLOSMS. Record Manager 
word addressable file organization is used to implement these 
files . 


In the following flow diagrams the symbol 


name 



performs J 


denotes a separate routine for which a separate flow diagram 
exists in the pages following. For a more detailed look at 
the proposed system, see the CARE III Computer Program 
Requirements Document. 
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